NYC Restaurant Inspection Data Analysis

Shivaram Karandikar

Outline

  • Questions of Interest
  • Data
    • Scope
    • Preparation
  • Visualization
  • Analysis
    • Hypothesis Testing
    • Random Forest Regression
  • Conclusion

Questions of Interest

  • How does the distribution of inspection scores differ between boroughs?
  • What are the factors that contribute to a restaurant’s inspection score?

Previous Work

  • Moraes, R.M., Bari, A., Zhu, J. (2019)
    • Used NYC Restaurant Inspection Data in conjunction with crime data to predict apartment prices in NYC.
      • Found that recurrent neural networks using the inspection and crime data could predict apartment prices with greater accuracy than ARIMA, or Auto-Regressive Integrated Moving Average models.
  • What can we do differently?
    • Inspection data has been successfully applied to an external problem, but how can we use it to better understand the restaurant inspection process itself?
    • Can we use the data to predict inspection scores?
      • Does the type of inspection affect the score?
      • How do the types and number of violations affect the score?
      • Time/Day, Location (Borough), Cuisine, etc.

Data

Scope

  • NYC OpenData: DOHMH New York City Restaurant Inspection Results
    • January 1st, 2022 to December 31st, 2022
    • 92,927 rows, 32 columns
      • Each row represents a single violation for a restaurant.
    • Variables of Interest: ‘SCORE’ and ‘GRADE’
      • ‘SCORE’ is the sum of all violation points for the inspection.
      • ‘GRADE’ is the letter grade based off the score:
        • A: 0 <= ‘SCORE’ <= 13
        • B: 14 <= ‘SCORE’ <= 27
        • C: 28 <= ‘SCORE’
CAMIS DBA BORO BUILDING STREET ZIPCODE PHONE CUISINE DESCRIPTION INSPECTION DATE ACTION VIOLATION CODE VIOLATION DESCRIPTION CRITICAL FLAG SCORE GRADE GRADE DATE RECORD DATE INSPECTION TYPE Latitude Longitude Community Board Council District Census Tract BIN BBL NTA Location Point1 Zip Codes Community Districts Borough Boundaries City Council Districts Police Precincts
0 50112431 OCEANIC BOIL Queens 9618 QUEENS BLVD 11374.0 3478320123 Seafood 01/03/2022 Violations were cited in the following area(s). 02B Hot food item not held at or above 140º F. Critical 20.0 N NaN 03/23/2023 Pre-permit (Operational) / Initial Inspection 40.729887 -73.861724 406.0 29.0 69300.0 4072129.0 4.030820e+09 QN18 NaN NaN NaN NaN NaN NaN
1 40367534 MR BROADWAY KOSHER RESTAURANT Manhattan 1372 BROADWAY 10018.0 2129212152 Jewish/Kosher 01/03/2022 Violations were cited in the following area(s). 10F Non-food contact surface improperly constructe... Not Critical 13.0 A 01/03/2022 03/23/2023 Cycle Inspection / Initial Inspection 40.752263 -73.987454 105.0 4.0 10900.0 1080609.0 1.008130e+09 MN17 NaN NaN NaN NaN NaN NaN
2 40390536 TASTE OF INDIA II RESTAURANT Staten Island 287 NEWDORP LN NaN 7189874700 Indian 01/03/2022 Violations were cited in the following area(s). 02G Cold food item held above 41º F (smoked fish a... Critical 13.0 A 01/03/2022 03/23/2023 Cycle Inspection / Initial Inspection 0.000000 0.000000 NaN NaN NaN NaN 5.000000e+00 NaN NaN NaN NaN NaN NaN NaN
3 50118245 NEW LUCK GARDEN Manhattan 1954 AMSTERDAM AVENUE 10032.0 2122835533 Chinese 01/03/2022 Violations were cited in the following area(s). 04L Evidence of mice or live mice present in facil... Critical 21.0 NaN NaN 03/23/2023 Pre-permit (Operational) / Initial Inspection 40.832838 -73.942062 112.0 7.0 24100.0 1062713.0 1.021150e+09 MN36 NaN NaN NaN NaN NaN NaN
4 50105699 LIBERTY COFFEE SHOP Queens 8806 LIBERTY AVE 11417.0 7189476590 Coffee/Tea 01/03/2022 Violations were cited in the following area(s). 10H Proper sanitization not provided for utensil w... Not Critical 4.0 NaN NaN 03/23/2023 Cycle Inspection / Initial Inspection 40.679888 -73.850478 410.0 32.0 5400.0 4190609.0 4.091530e+09 QN56 NaN NaN NaN NaN NaN NaN

Preparation

  • Variables irrelevant to analysis are removed.
  • Missing/Incorrect Values:
    • Missing zip code values filled using the uszipcode package.
      • Unspecified boroughs imputed using the zip code.
    • Rows with missing ‘SCORE’ and ‘GRADE’ values are removed.
    • Incorrect ‘VIOLATION CODE’ values are removed.
CAMIS DBA BORO BUILDING STREET ZIPCODE PHONE CUISINE DESCRIPTION INSPECTION DATE ACTION VIOLATION CODE VIOLATION DESCRIPTION CRITICAL FLAG SCORE GRADE GRADE DATE INSPECTION TYPE Latitude Longitude Community Board Council District Census Tract BIN BBL NTA Violation_Code Health_Code Violation_Summary Category_Description Violation_Template Condition I Condition II Condition III Condition IV Condition V VIOLATION COUNT CRITICAL COUNT DAY TYPE COOKING HOT HOLDING REHEATING & HOT HOLDING COLD HOLDING REDUCE OXYGEN PACKAGE COOLING & REFRIGERATION UNAPPROVED SOURCE FOOD PROTECTION ADULTERATED PLUMBING CONTAMINATION PERMIT/FPC FOOD WORKERS HACCP PLAN TEMPERATURE REGULATING PEST CONTROL LIGHT, HEAT & VENTILATION MAINTENANCE, CONSTRUCTION & PLACEMENT HANDWASH/TOILET WAREWASHING UTENSILS SIGNS
0 40367534 MR BROADWAY KOSHER RESTAURANT Manhattan 1372 BROADWAY 10018.0 2129212152 Jewish/Kosher 01/03/2022 Violation 10F Non-food contact surface improperly constructe... 0 13.0 A 01/03/2022 Cycle Inspection / Initial Inspection 40.752263 -73.987454 105.0 4.0 10900.0 1080609.0 1.008130e+09 MN17 10F NYCHC 81.17(e)(1) Flooring improperly constructed/maintained MAINTENANCE, CONSTRUCTION & PLACEMENT Flooring improperly constructed or maintained ... One non-food contact surface improperly constr... Two non-food contact surfaces improperly const... Three non-food contact surfaces improperly con... Four non-food contact surfaces improperly cons... NaN 3 1 Weekday 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0
1 40390536 TASTE OF INDIA II RESTAURANT Staten Island 287 NEWDORP LN 97003.0 7189874700 Indian 01/03/2022 Violation 02G Cold food item held above 41º F (smoked fish a... 1 13.0 A 01/03/2022 Cycle Inspection / Initial Inspection 0.000000 0.000000 NaN NaN NaN NaN 5.000000e+00 NaN 02G NYCHC 81.09(a) PHF held above 41°F COLD HOLDING Potentially hazardous cold food, other than pr... One cold food item out of temperature in one a... Two cold food items out of temperature or the ... Three cold food items out of temperature. Exam... Four cold food items out of temperature. Exam... Failure to correct any conditions of a PHH at ... 2 2 Weekday 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2 40370436 ROSSINI'S Manhattan 108 EAST 38 STREET 10016.0 2126830135 Italian 01/03/2022 Violation 10E Accurate thermometer not provided in refrigera... 0 12.0 A 01/03/2022 Cycle Inspection / Initial Inspection 40.749349 -73.978941 106.0 4.0 8000.0 1019118.0 1.008930e+09 MN20 10E NYCHC 81.18(a)(3) Thermometers: cold storage/refrigerator TEMPERATURE REGULATING A numerically scaled temperature monitoring de... One refrigeration or hot holding unit not prov... Two refrigeration or hot holding units not pro... Three refrigeration or hot holding units not p... Four refrigeration or hot holdings units not p... NaN 4 1 Weekday 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 2 0 0 0 0
3 50117909 JACKS COFFEE Manhattan 138 WEST 10 STREET 10014.0 6469644182 Coffee/Tea 01/03/2022 Violation 10I Single service item reused, improperly stored,... 0 7.0 A 01/03/2022 Pre-permit (Operational) / Initial Inspection 40.734582 -74.000595 102.0 3.0 7100.0 1010684.0 1.006100e+09 MN23 10I NYCHC 81.07(o) Single service articles improperly stored/reused UTENSILS Single service articles improperly <USED/STORE... Single service item reused, improperly stored,... Single service item reused, improperly stored,... Single service item reused, improperly stored,... Single service item reused, improperly stored,... NaN 2 1 Weekday 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
4 50107891 SLICE BROADWAY Queens 3812 BROADWAY 11103.0 9176935593 Pizza 01/03/2022 Violation 10F Non-food contact surface improperly constructe... 0 37.0 C 01/03/2022 Pre-permit (Operational) / Re-inspection 40.759276 -73.919677 401.0 22.0 15500.0 4010408.0 4.006560e+09 QN70 10F NYCHC 81.17(e)(1) Flooring improperly constructed/maintained MAINTENANCE, CONSTRUCTION & PLACEMENT Flooring improperly constructed or maintained ... One non-food contact surface improperly constr... Two non-food contact surfaces improperly const... Three non-food contact surfaces improperly con... Four non-food contact surfaces improperly cons... NaN 7 4 Weekday 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 2 0 2 0 0 0 0
  • Joined restaurant inspection data with violation code data from the Violation Health Code Mapping dataset.

    • Includes important categorical information about each violation.
  • Created new variables:

    • ‘VIOLATION COUNT’: Number of violations for each inspection.
    • ‘CRITICAL COUNT’: Number of critical violations for each inspection. A critical violation is one that could cause foodborne illness.
    • ‘DAY TYPE’: Whether the inspection was on a weekday or weekend.
    • 23 new variables for each violation category.
      • Each variable represents the number of violations in each category for each inspection.
  • Data is now grouped by restaurant and inspection date.

    • Each row represents a single inspection for a restaurant.
CAMIS DBA CUISINE DESCRIPTION BORO INSPECTION DATE SCORE GRADE INSPECTION TYPE VIOLATION COUNT CRITICAL COUNT DAY TYPE COOKING HOT HOLDING REHEATING & HOT HOLDING COLD HOLDING REDUCE OXYGEN PACKAGE COOLING & REFRIGERATION UNAPPROVED SOURCE FOOD PROTECTION ADULTERATED PLUMBING CONTAMINATION PERMIT/FPC FOOD WORKERS HACCP PLAN TEMPERATURE REGULATING PEST CONTROL LIGHT, HEAT & VENTILATION MAINTENANCE, CONSTRUCTION & PLACEMENT HANDWASH/TOILET WAREWASHING UTENSILS SIGNS
0 40367534 MR BROADWAY KOSHER RESTAURANT Jewish/Kosher Manhattan 01/03/2022 13.0 A Cycle Inspection / Initial Inspection 3 1 Weekday 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0
1 40390536 TASTE OF INDIA II RESTAURANT Indian Staten Island 01/03/2022 13.0 A Cycle Inspection / Initial Inspection 2 2 Weekday 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2 40370436 ROSSINI'S Italian Manhattan 01/03/2022 12.0 A Cycle Inspection / Initial Inspection 4 1 Weekday 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 2 0 0 0 0
3 50117909 JACKS COFFEE Coffee/Tea Manhattan 01/03/2022 7.0 A Pre-permit (Operational) / Initial Inspection 2 1 Weekday 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0
4 50107891 SLICE BROADWAY Pizza Queens 01/03/2022 37.0 C Pre-permit (Operational) / Re-inspection 7 4 Weekday 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 2 0 2 0 0 0 0

Visualization

Visualization

<ggplot: (186874508201)>

<ggplot: (186872947769)>

<ggplot: (186878196421)>

Analysis

Hypothesis Testing

  • The Kruskal-Wallis test is used to compare the distributions of SCORE for each borough.
  • The hypotheses are as follows:
    • \(H_0\): The distributions of SCORE for borough \(i\) are equal.
    • \(H_1\): The distributions of SCORE for borough \(i\) are not equal.
count mean std min 25% 50% 75% max
BORO
Bronx 1179.0 10.756573 6.852753 0.0 7.0 11.0 13.0 73.0
Brooklyn 4331.0 10.501039 7.217613 0.0 7.0 10.0 12.0 75.0
Manhattan 5960.0 10.295134 7.133162 0.0 7.0 10.0 12.0 100.0
Queens 3450.0 10.983768 7.283775 0.0 7.0 11.0 13.0 77.0
Staten Island 638.0 10.945141 5.201600 0.0 9.0 11.0 13.0 60.0
KruskalResult(statistic=63.67589952287163, pvalue=4.890239593852517e-13)

The null hypothesis is rejected at \(\alpha = 0.01\).

  • Dunn’s Test is a post-hoc test which performs multiple pairwise comparisons between the distributions of SCORE for each borough.
  • The hypotheses are as follows:
    • \(H_0\): The distributions of SCORE for borough \(i\) and borough \(j\) are the same.
    • \(H_1\): The distributions of SCORE for borough \(i\) and borough \(j\) are different.
Bronx Brooklyn Manhattan Queens Staten Island
Bronx 1.000000 0.067104 5.229297e-03 1.000000e+00 5.785141e-02
Brooklyn 0.067104 1.000000 1.000000e+00 1.966773e-04 1.168076e-06
Manhattan 0.005229 1.000000 1.000000e+00 2.712457e-07 3.413853e-08
Queens 1.000000 0.000197 2.712457e-07 1.000000e+00 3.143599e-02
Staten Island 0.057851 0.000001 3.413853e-08 3.143599e-02 1.000000e+00
Bronx Brooklyn Manhattan Queens Staten Island
Bronx False False True False False
Brooklyn False False False True True
Manhattan True False False True True
Queens False True True False True
Staten Island False True True True False

Random Forest Regression

  • Random Forest Regression is used to predict the inspection score.
  • Preprocessing steps include:
    • Dummy variables are created for categorical variables.
    • ‘GRADE’ is converted to a numeric variable using ordinal encoding.
    • The data is split into training and testing sets.
SCORE GRADE VIOLATION COUNT CRITICAL COUNT COOKING HOT HOLDING REHEATING & HOT HOLDING COLD HOLDING REDUCE OXYGEN PACKAGE COOLING & REFRIGERATION UNAPPROVED SOURCE FOOD PROTECTION ADULTERATED PLUMBING CONTAMINATION PERMIT/FPC FOOD WORKERS HACCP PLAN TEMPERATURE REGULATING PEST CONTROL LIGHT, HEAT & VENTILATION MAINTENANCE, CONSTRUCTION & PLACEMENT HANDWASH/TOILET WAREWASHING UTENSILS SIGNS BORO_Bronx BORO_Brooklyn BORO_Manhattan BORO_Queens BORO_Staten Island DAY TYPE_Weekday DAY TYPE_Weekend INSPECTION TYPE_Cycle INSPECTION TYPE_Pre-permit
0 13.0 1 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0
1 13.0 1 2 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0
2 12.0 1 4 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 2 0 0 0 0 0 0 1 0 0 1 0 1 0
3 7.0 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1
4 37.0 3 7 4 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 2 0 2 0 0 0 0 0 0 0 1 0 1 0 0 1
  • Average Baseline error is calculated from the mean of the absolute difference between the average training score and the testing score.
Average Baseline Error: 4.26

Models are created using RandomForestRegressor from sklearn.ensemble

The first model rf is created using \(n = 1000\) trees.

The parameters of the model are as follows:

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 1000, 'n_jobs': None, 'oob_score': False, 'random_state': 19, 'verbose': 0, 'warm_start': False}

Results:

Mean Squared Error: 5.72
Mean Absolute Error: 1.26
Root Mean Squared Error: 2.39
R^2: 0.8846

The second model gs_rf is created using GridSearchCV to find the optimal parameters for the model.

The parameters of the model are as follows:

{'bootstrap': True, 'max_depth': 80, 'max_features': 3, 'min_samples_leaf': 3, 'min_samples_split': 12, 'n_estimators': 100}

Results:

Mean Squared Error: 11.50
Mean Absolute Error: 1.75
Root Mean Squared Error: 3.39
R^2: 0.7683

The third model rs_rf is created using RandomizedSearchCV to find the optimal parameters for the model.

The parameters of the model are as follows:

{'n_estimators': 1400, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': None, 'bootstrap': True}

Results:

Mean Squared Error: 5.55
Mean Absolute Error: 1.26
Root Mean Squared Error: 2.36
R^2: 0.8881
Model MAE MSE RMSE \(R^2\)
Baseline 4.26 - - -
rf 1.26 5.72 2.39 0.8846
gs_rf 1.75 11.50 3.39 0.7683
rs_rf 1.26 5.55 2.36 0.8881

The Randomized Search model has the highest \(R^2\) value, and the lowest MSE and RMSE values. Along with the rf model, the rs_rf model has an MAE value that is substantially lower than the baseline error.

The permutation importance is given below:

Conclusion

In this analysis, we have explored the relationship between the inspection score of a restaurant and a number of factors. Hypothesis testing determined that the distribution of scores differs across multiple boroughs. In addition, a random forest regression model was created to predict the inspection score of a restaurant. The model was able to predict the inspection score with an \(R^2\) value of 0.8881.

There are limitations to this analysis. Errors in data as well as neglect of some variables may have affected the results. In the future, it would be interesting to explore the text data, such as the business name, cuisine type, and violation description, to see if they can be used to predict the inspection score.